Skip to content

server: preserve primary KV cache when MTP companion trim fails#1889

Closed
localweights wants to merge 1 commit into
ikawrakow:mainfrom
localweights:fix-mtp-companion-preserve-cache
Closed

server: preserve primary KV cache when MTP companion trim fails#1889
localweights wants to merge 1 commit into
ikawrakow:mainfrom
localweights:fix-mtp-companion-preserve-cache

Conversation

@localweights

Copy link
Copy Markdown

Summary

The pre-batch reset path in server_context partially trims both the target ctx and (when MTP is enabled) its speculative companion ctx at p0 = system + n_past. The existing logic treats either trim failure as a reason to nuke cache_tokens, slot.n_past, n_prompt_tokens_cache, the checkpoint list, and reset the sampler.

Companion failures happen routinely after generation: unvalidated draft tokens leave the companion KV's position layout out of sync with primary. Sacrificing the primary cache for a recoverable mismatch confined to the draft ctx forces a full re-prefill on the next request, defeating the entire point of prefix caching when MTP is on.

Change

Split the fallback into two paths:

  • target_trimmed && !companion_trimmed → wipe only the companion (it repopulates during the next prefill); leave the primary cache + checkpoints + sampler state intact.
  • !target_trimmed → unchanged conservative full reset (the original non-Transformer fall-through case that the existing comment alludes to).

Validation

Tested on Qwen3.6-27B + --multi-token-prediction --draft-max 3 + --reasoning on. Combined with #1888 (qwen3next checkpoint reuse), multi-pass synthesis goes from 0% prefix-cache reuse to 92% reuse on shared-prefix follow-up calls. Without this patch the companion-trim failure path still wiped the primary cache and undid the checkpoint fix.

Note

This patch addresses the pre-batch reset site that exists in current main. PR #1877 (Fix prompt cache viability) introduces a similar trim-fallback at the second post-prefix-match site; the same split should be applied there when that PR lands.

The pre-batch reset path in server-context partially trims both the
target ctx and (if MTP is enabled) its speculative companion ctx at
p0 = system + n_past. Either failure currently triggers a full reset
that nukes cache_tokens, slot.n_past, n_prompt_tokens_cache, the
checkpoint list, and the sampler state.

Companion failures are common after generation because unvalidated
draft tokens leave the companion KV's position layout out of sync
with the primary's. Sacrificing the primary cache for that recoverable
mismatch forces a full re-prefill on the next request, even though the
primary KV trim succeeded.

This change splits the fallback: when only the companion fails, wipe
just the companion (it repopulates during the next prefill) and keep
the primary cache + checkpoints intact. The full-reset path remains
in place for when the primary itself fails to trim (non-Transformer
fall-through case the comment alludes to).

Validated on Qwen3.6-27B + --multi-token-prediction --draft-max 3: 92%
prefix-cache reuse on multi-pass synthesis vs 0% before this change.
@ikawrakow

Copy link
Copy Markdown
Owner

Can you provide a reproduction where trimming one context succeeds but trimming the other fails?

@ikawrakow

Copy link
Copy Markdown
Owner

Add an issue with reproduction. After that you can resubmit the PR.

@ikawrakow ikawrakow closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants